========================================================

Introduction :

We will analyze the red wind quality dataset by using R and apply exploratory data analysis techniques to investigate and explore the relationship in the dataset from difference anlges one variables , two varibales , multi-variables. Further , we will see the disturbtion of the data and outliers.

WE included all libraries that we will use in our analysis.
library(ggplot2)
library(GGally)
library(dplyr)
library(gridExtra)

Univariate Plots Section

we load the dataset winequalityred by using read.csv function in R .
we explore the dataset by seeing the names of variables and other features as below
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##     X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1   1           7.4             0.70        0.00            1.9     0.076
## 2   2           7.8             0.88        0.00            2.6     0.098
## 3   3           7.8             0.76        0.04            2.3     0.092
## 4   4          11.2             0.28        0.56            1.9     0.075
## 5   5           7.4             0.70        0.00            1.9     0.076
## 6   6           7.4             0.66        0.00            1.8     0.075
## 7   7           7.9             0.60        0.06            1.6     0.069
## 8   8           7.3             0.65        0.00            1.2     0.065
## 9   9           7.8             0.58        0.02            2.0     0.073
## 10 10           7.5             0.50        0.36            6.1     0.071
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   11                   34  0.9978 3.51      0.56     9.4
## 2                   25                   67  0.9968 3.20      0.68     9.8
## 3                   15                   54  0.9970 3.26      0.65     9.8
## 4                   17                   60  0.9980 3.16      0.58     9.8
## 5                   11                   34  0.9978 3.51      0.56     9.4
## 6                   13                   40  0.9978 3.51      0.56     9.4
## 7                   15                   59  0.9964 3.30      0.46     9.4
## 8                   15                   21  0.9946 3.39      0.47    10.0
## 9                    9                   18  0.9968 3.36      0.57     9.5
## 10                  17                  102  0.9978 3.35      0.80    10.5
##    quality
## 1        5
## 2        5
## 3        5
## 4        6
## 5        5
## 6        5
## 7        5
## 8        7
## 9        7
## 10       5
summary(RW$X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   400.5   800.0   800.0  1200.0  1599.0
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
From the above preliminary analysis, we found out that there are 1599 instances of red wine and 13 features(variables) for each instance. All these variables are numericals. we can tell from the first glance there are some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The above plot shows the distribution of the quality of wine over the provided dataset, we can see that most of the red wine instances have average quality as can be seen from the above plot as well as the summary function for the red wine data with quality. So , we will create a new categorical variable for the discrete quality variable to simplify our analysis and we will name it rate.

As can be seen above , most the red wine rate fells down on the average section as we said in the previous comment.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
The above the plot shows that the concentere of alchohol in the red wine instance.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
The above graph shows the distribution of citric acid over the red wine instance , as can be seen that most of the distribution below 0.50 and it also has outlier as can be seen at 1.00 .
The below plot shows the concentration of red wine by seeing the distribution of fixed acidity variable over the whole instnaces.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
The residula.sugar variable concentration are very little in the red wine data instance as can be seen in the blow graph ,most of the instances have less 3 residula sugar. On the other words , more concentration of the red wines around 2 . So , it’s not normally distributed .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
The blow graph shows the distribution of volatile acidity through the red wine data set,it shows bimodal between 0.3 and 0.7 and there are some outliers in the higher ranges as can be seen below plots.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
The below plot shows the distribution of chlorides in the red wine data set, it shows that the distribution is not normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of free sulfur dioxide of the red wine seems not normal distribution as can be seen below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
The distribution of total sulfur dioxide of the red wine seems not normal distribution as can be seen below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
The distribution of density variable for the red wine instances is normal distribution as can be seen in the below graph.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
Also , the pH Distribution seems to be like the density’s distribution which normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
The sulphates distribution of the red wine isntances is not normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
The below graph shows the distribution of alcohol for the red wine data set and it is not normal distribution.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 1599 redwine instances along with 13 variables. We have added one categorical variable that represents the quality varibales for the red wine. Moreover, the whole variables are numerical but the variable that we created above. Most the red wine rate is average.

What is/are the main feature(s) of interest in your dataset?

The main feature that interests me in the dataset is the quality of the redwine especilly with alcohol. I want to see if there’s any correlation between these two features along with others.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I will investigate the other features like ph, density, acidity (critic.acid,fixeda.cidity). Further, residul.Sugar and total.sulfur.dioxed might have affect on the people’s taste of the red wine.

Did you create any new variables from existing variables in the dataset?

Yes , I did create rate variable based on the quality variable which help us in our analysis to simplify.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I almost investigated most the features, there was unusual distributions in two plots above critic.acid and alcohol. We will investigate more about these features but with other variables.

Bivariate Plots Section

we investigated the relationship between alcohol and quailty because I am curious to see if there’s any relationship, the above graph shows that there’s a relationship between the quality of red wine and alcohol concentration on the instances.

As can be seen in the above plot , it shows that there is no correlation between rate (quality of the red wine instances)and the volatite acidity

The above plot shows the relationship between fixed acidity with critic acid. Further, they are correlated to each other by 0.6717034.

The above plot shows that there is no relationship or correlation between chlorides and qaulity variables.

The above plot shows that there is no relationship between density and qaulity variables , however, excellent quality wine has low desnity .

The prevoius plots show the relationship between citric.acid and density variables of red wine with pH variable.There are relationships between these variables with pH variables.

The above graph shows a strong relationship between the two variables total and freee sulfur dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

We did invesigation about the relationships of vairables for the red wine dataset ,and we found out the below: It seems that there’s a relationship between the quality of wine and the concentration of alcohol. However,there is a strong relationship between citric acid and the quality of red wine the more citric acide concentration , the better quality the red wine will be.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

we have checked my relatiohships between the other features such as pH with density and citric acid and total sulfur dioxide and free sulfur dioxide. we found there are some relationships between these variables. Nevertheless , there is a strong relatioship between total and free sulfur dioxide variables as can be seen in the above graphs .

What was the strongest relationship you found?

I think the strongest relationship that I found was the relationship between total and free sulfur dioxide variables. Further, the citric acid with Fixed acidity variable.

Multivariate Plots Section

cor(RW$free.sulfur.dioxide,RW$total.sulfur.dioxide)
## [1] 0.6676665

The above plot shows the correlation between the two variables free and total sulfur dioxide for the red wine quality.

the above graph shows the relationship between alcohol and total sulfur dioxide variables with the rate of red wine instance.

cor.test(RW$citric.acid, RW$fixed.acidity)
## 
##  Pearson's product-moment correlation
## 
## data:  RW$citric.acid and RW$fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034
cor.test(RW$citric.acid, RW$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  RW$citric.acid and RW$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
cor.test(RW$fixed.acidity, RW$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  RW$fixed.acidity and RW$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
The above graph shows that a strong relationship between fixed acidity and citric acid , and they are correlated stronglly to each other by 0.67. However, I tried to get the correlation estimation between citrix acid or fixed acidity but I found an error becuase the quality variable is not numeric . It’s integer, so I will have to change it to be numberic as can be seen abvoe. We found out the correlation between citric acid with quality stronger(0.2263725) than fixed acidity (0.1240516)

The previous plot shows the relationship between Density and pH variables along with red wine quality ,we can see that the low level of pH can have both high density and excellent quality of red wine instane. Further, it can have poor quality of red wine for the pH. it gives us a clue that the other variables could alos have an influence on the quality of red wine too.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

We looked into different relationships between the variables in the red wine datasets and we found out that there is a strong relationship between fixed acidity and citric acid , and they are correlated stronglly to each other. Further, there is relationship between Density and pH variables along with red wine quality ,we can see that the low level of pH can have both high density and excellent quality of red wine instane. Further, it can have poor quality of red wine for the pH. it gives us a clue that the other variables could alos have an influence on the quality of red wine too.

Were there any interesting or surprising interactions between features?

Yes ,there are many interesting interactions between the total sulfur dioxide and alcohol even though there are some outliers in the excellent quality for the red wine. Moreover, there are interesting interactions between citric acid and fixed acidity variables . ### OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The plot one shows the relationship between fixed acidity with critic acid. Further, they are correlated to each other by 0.6717034.

Plot Two

Description Two

The plot two shows the relationship between Density and pH variables along with red wine quality ,we can see that the low level of pH can have both high density and excellent quality of red wine instane. Further, it can have poor quality of red wine for the pH. it gives us a clue that the other variables could alos have an influence on the quality of red wine too.

Plot Three

Description Three

The plot three shows a strong relationship between fixed acidity and citric acid , and they are correlated stronglly to each other by 0.67. However, I tried to get the correlation estimation between citrix acid or fixed acidity and We found out the correlation between citric acid with quality stronger(0.2263725) than fixed acidity (0.1240516)


Reflection

We have investigated the red wine data set which has 1599 instance and 13 variables. Further, we created one categorical variable to represent the rate of the red wine quality. Moreover , one variable , two variables , Mulit-variables plot were created through out the above investigations. We found out that there are many varibales are correlated to each others like citric acid and fixed acidity. In addition , Many other factors may affect the quality of the red wine instance like total sulfur toxidie , alcohol and other variables. We run out to some issues in calcualting the correlation between the quality and other variables . So,we had to tranfer the quality from integer to numberic variables as we did above to be able to get the estimation of correlation between chemical factors and quality of the red wine instance. This project was a great excercise and lesson for me even I have a lot and want to do like correlation matrix and heatmap but maybe in future works and courses .